This document is submitted as a final project of Linear Regression and Modeling course as a part of Coursera Statistics with R specialiation. The project utilizes multiple linear regression to analyse the provided data.
Research question:
Is there linear relationship between the scores given by the audience and runtime of movie or the fact whether or not the movie director has ever won an Oscar?
How can we predict the audience score regarding these two variables?
The data set is comprised of 651 randomly sampled movies produced and released before 2016.
Inference generalizability and causality:
The random sampling used for the data collection allows us to generalize the results of the analysis to the population. However, as it is an observational study with no random assignment, we cannot assure the causality between variables.
More Info:
Rotten Tomatoes
IMDB
Is there linear relationship between the scores given by the audience and runtime of movie or the fact whether or not the movie director has ever won an Oscar?
How can we predict the audience score regarding these two variables?
The response variable for this analysis is the audience score(audience_score) and the two explanatory variables are the runtime of movie(runtime) and the fact whether or not the movie director ever won an Oscar(best_dir_win).
rm(list=ls())
# default r markdown global options in document
knitr::opts_chunk$set(
########## Text results ##########
echo=TRUE,
warning=FALSE, # to preserve warnings in the output
error=FALSE, # to preserve errors in the output
message=FALSE, # to preserve messages
strip.white=TRUE, # to remove the white lines in the beginning or end of a source chunk in the output
########## Cache ##########
cache=TRUE,
########## Plots ##########
fig.path="", # prefix to be used for figure filenames
fig.width=8,
fig.height=6,
dpi=200
)library(tidyverse) # load tidyverse library
library(ggplot2)
library(statsr) # load statsr library
#library(gplots) # load gplots library
library(knitr)knitr::opts_chunk$set(echo = TRUE,cache=TRUE)
setwd("~/Desktop/Coursera/Statistics_with_R/labs_Sunah/3.4_Assignment") # set the working directory
load("movies.Rdata")
df<-as.data.frame(movies) # assign brfss2013 as data frame with the name 'df'dim(df)## [1] 651 32
head(df,2) # first 2 rows of the data frame## title title_type genre runtime mpaa_rating studio
## 1 Filly Brown Feature Film Drama 80 R Indomina Media Inc.
## 2 The Dish Feature Film Drama 101 PG-13 Warner Bros. Pictures
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month
## 1 2013 4 19 2013 7
## 2 2001 3 14 2001 8
## dvd_rel_day imdb_rating imdb_num_votes critics_rating critics_score
## 1 30 5.5 899 Rotten 45
## 2 28 7.3 12285 Certified Fresh 96
## audience_rating audience_score best_pic_nom best_pic_win best_actor_win
## 1 Upright 73 no no no
## 2 Upright 81 no no no
## best_actress_win best_dir_win top200_box director actor1
## 1 no no no Michael D. Olmos Gina Rodriguez
## 2 no no no Rob Sitch Sam Neill
## actor2 actor3 actor4 actor5
## 1 Jenni Rivera Lou Diamond Phillips Emilio Rivera Joseph Julian Soria
## 2 Kevin Harrington Patrick Warburton Tom Long Genevieve Mooy
## imdb_url
## 1 http://www.imdb.com/title/tt1869425/
## 2 http://www.imdb.com/title/tt0205873/
## rt_url
## 1 //www.rottentomatoes.com/m/filly_brown_2012/
## 2 //www.rottentomatoes.com/m/dish/
colnames(df)## [1] "title" "title_type" "genre"
## [4] "runtime" "mpaa_rating" "studio"
## [7] "thtr_rel_year" "thtr_rel_month" "thtr_rel_day"
## [10] "dvd_rel_year" "dvd_rel_month" "dvd_rel_day"
## [13] "imdb_rating" "imdb_num_votes" "critics_rating"
## [16] "critics_score" "audience_rating" "audience_score"
## [19] "best_pic_nom" "best_pic_win" "best_actor_win"
## [22] "best_actress_win" "best_dir_win" "top200_box"
## [25] "director" "actor1" "actor2"
## [28] "actor3" "actor4" "actor5"
## [31] "imdb_url" "rt_url"
df<-df %>%
select(audience_score, director, runtime, best_dir_win) %>%
arrange(director) %>%
na.omit
ggplot(data=df, aes(runtime))+geom_histogram(aes(runtime),alpha=0.4, bins=30)+theme_bw(base_family="Avenir")+labs(x="Runtime in minutes", y="Count", title="")+guides(color=guide_legend(title="Director ever won an Oscar"))+theme(legend.position="bottom")+geom_vline(xintercept=mean(df$runtime), linetype="dotted",color="black")ggplot(data=df, aes(x=runtime, y=audience_score))+geom_jitter(alpha=.5, aes(color=best_dir_win))+theme_bw(base_family="Avenir")+labs(x="Runtime in minutes", y="Audience score", title="")+guides(color=guide_legend(title="Director ever won an Oscar"))+theme(legend.position="bottom")+geom_smooth(method="lm", size=0.5, color="grey")ggplot(data=df, aes(x=runtime,y=audience_score, color=best_dir_win))+geom_boxplot()+theme_bw(base_family="Avenir")+facet_wrap(.~best_dir_win)+labs(x="Runtime in minutes", y="Audience score", title="")+guides(color=guide_legend(title="Director ever won an Oscar"))+theme(legend.position="bottom")summary(df$runtime)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.0 92.0 102.5 105.7 115.0 267.0
df_no<-df %>% filter(best_dir_win=="no")
df_yes<-df %>% filter(best_dir_win=="yes")
summary(df_no);summary(df_yes)## audience_score director runtime best_dir_win
## Min. :11.00 Length:605 Min. : 39.0 no :605
## 1st Qu.:45.00 Class :character 1st Qu.: 92.0 yes: 0
## Median :65.00 Mode :character Median :101.0
## Mean :61.82 Mean :104.4
## 3rd Qu.:79.00 3rd Qu.:114.0
## Max. :96.00 Max. :267.0
## audience_score director runtime best_dir_win
## Min. :27.00 Length:43 Min. : 84 no : 0
## 1st Qu.:53.00 Class :character 1st Qu.:106 yes:43
## Median :73.00 Mode :character Median :122
## Mean :69.51 Mean :124
## 3rd Qu.:85.50 3rd Qu.:138
## Max. :97.00 Max. :202
Findings from EDA:
The interquartile range of the movie runtime is 92 minutes to 115 minutes. The mean value of the runtime is 105 minutes as illustrated with the dotted line in the histogram. The audience score in the interquartile range of runtime varies quite a lot as shown in the point plot. We can already observe that there is a weak linearity between the audience score and runtime, but the study will still continue the linear modeling with the aim to learn from the results.
In total there are 43 movies which are filmed by directors who have ever won Oscar(From now on, this group will be called as Yes-Oscar). The rest of 605 movies were filmed by those who have never won Oscar(From now on, this group will be called as No-Oscar). The range of audience score is wider in No-Oscar groups(11-96) compared to the Yes-Oscar (27-97). The mean and the median values of Yes-Oscar movies are higher than the No-Oscar movies. However, as the boxplot illustrates, there could exist some dependency between runtime and Oscar won.
fit<-lm(audience_score~runtime+best_dir_win, data=df) # multiple linear model
summary(fit)##
## Call:
## lm(formula = audience_score ~ runtime + best_dir_win, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.185 -16.135 3.004 17.380 34.924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.58806 4.42561 9.849 < 2e-16 ***
## runtime 0.17455 0.04166 4.190 3.17e-05 ***
## best_dir_winyes 4.28297 3.24821 1.319 0.188
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.93 on 645 degrees of freedom
## Multiple R-squared: 0.03521, Adjusted R-squared: 0.03222
## F-statistic: 11.77 on 2 and 645 DF, p-value: 9.527e-06
anova(fit)## Analysis of Variance Table
##
## Response: audience_score
## Df Sum Sq Mean Sq F value Pr(>F)
## runtime 1 8656 8656.4 21.8036 3.676e-06 ***
## best_dir_win 1 690 690.3 1.7386 0.1878
## Residuals 645 256075 397.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the model selection, the stepwise backward p-value elimination model was used. The process of the model starts with the full model. If the p-value is above the significant level, here we consider 0.05, then that variable will be dropped. Afterwards we refit the model as well as repeat the process. If the largest p-value is less than the significance level, we would not eliminate any predictors and the current model would be the best-fitting model. As the p-value of best_dir_win from the full model is greater than the significant level of 0.05, we drop this variable and refit the model.
fit2<-lm(audience_score~runtime, data=df) # refitting after dropping best_dir_win
summary(fit2)##
## Call:
## lm(formula = audience_score ~ runtime, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.638 -15.711 3.011 17.126 34.953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.41771 4.33816 9.778 < 2e-16 ***
## runtime 0.18831 0.04035 4.667 3.72e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.94 on 646 degrees of freedom
## Multiple R-squared: 0.03261, Adjusted R-squared: 0.03112
## F-statistic: 21.78 on 1 and 646 DF, p-value: 3.722e-06
anova(fit2)## Analysis of Variance Table
##
## Response: audience_score
## Df Sum Sq Mean Sq F value Pr(>F)
## runtime 1 8656 8656.4 21.779 3.722e-06 ***
## Residuals 646 256765 397.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is now less than chosen significance level of 0.05. Therefore, we can choose this as our final model and accordingly we can set the final equation of our model as follows: \[ {score_{audience}}=42.42+0.19\times{runtime} \]
Alternatively, we can use the built-in step function using AIC strategy from statsr package for the model selection.
step<-step(fit, direction="backward", trace=TRUE, steps=100) # Choosing a model by AIC in a stepwise algorithm## Start: AIC=3880.61
## audience_score ~ runtime + best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_dir_win 1 690.3 256765 3880.4
## <none> 256075 3880.6
## - runtime 1 6971.4 263046 3896.0
##
## Step: AIC=3880.35
## audience_score ~ runtime
##
## Df Sum of Sq RSS AIC
## <none> 256765 3880.4
## - runtime 1 8656.4 265421 3899.8
It yields the same model selection.
Now we diagnose our final model according to the following conditions.
Condition 1. Linear relationship between variables The explanatory variable is linearly related to the response variable. For that we look for a random scatter around 0.
df_con<-as.data.frame(cbind(fit2$residuals,df$runtime))
colnames(df_con)<-c("residuals","runtime")
ggplot(df_con,aes(x=runtime,y=residuals))+geom_point(alpha=0.5)+theme_bw(base_family="Avenir")+geom_hline(yintercept=0, color="red",linetype="solid",size=0.2) Diagnostic: As shortly seen in the exploratory data analysis above, the explanatory and response variable does not show a linear trend.
Condition 2. Nearly normal residuals with mean 0
ggplot(df_con,aes(residuals))+geom_histogram()ggplot(df_con, aes(sample=residuals))+geom_qq(size=0.5)+geom_qq_line(size=0.5) Diagnostic: There is a left skew.
Condition 3. Constant variability of residuals Residuals should be equally variable for low and high values of the predicted response variable. Residuals should be randomly scattered in a band with a constant width around 0.
# Checking using residual plots of residuals vs. predicted:
ggplot(df_con,aes(x=fit2$fitted.values, y=residuals))+geom_point(alpha=0.5)+geom_hline(yintercept=0, color="red",linetype="solid",size=0.2)# Viewing absolute value of residuals vs. predicted to identify unusual observations:
ggplot(df_con,aes(x=fit2$fitted.values, y=abs(residuals)))+geom_point(alpha=0.5)+geom_hline(yintercept=0, color="red",linetype="solid",size=0.2) Diagnostic: The variability of residuals is non-constant.
Condition 4. Independence of residuals
ggplot(df_con, aes(x=runtime, y=residuals))+geom_point() Diagnostic: The observations are not independent.
We perform a prediction to test the final model.
newmovie<-data.frame(runtime=seq(30,300,15)) # Data frame of movies with runtimes between 30 min and 300 min with 15 minutes of interval.
pred<-predict.lm(fit2,newmovie, interval="prediction", level=0.95) # Prediction of the model
ndf<-data.frame(runtime=seq(30,300,15),
fit<-pred[,1],
lwr<-pred[,2],
upr<-pred[,3])
colnames(ndf)<-c("runtime","Expected score","lwr","upr"); ndf## runtime Expected score lwr upr
## 1 30 48.06699 8.431365 87.70260
## 2 45 50.89162 11.418483 90.36476
## 3 60 53.71626 14.370353 93.06217
## 4 75 56.54090 17.286634 95.79517
## 5 90 59.36554 20.167075 98.56400
## 6 105 62.19018 23.011524 101.36883
## 7 120 65.01482 25.819926 104.20971
## 8 135 67.83945 28.592326 107.08658
## 9 150 70.66409 31.328867 109.99932
## 10 165 73.48873 34.029790 112.94767
## 11 180 76.31337 36.695427 115.93131
## 12 195 79.13801 39.326203 118.94981
## 13 210 81.96265 41.922622 122.00267
## 14 225 84.78729 44.485270 125.08930
## 15 240 87.61192 47.014799 128.20905
## 16 255 90.43656 49.511926 131.36120
## 17 270 93.26120 51.977422 134.54498
## 18 285 96.08584 54.412106 137.75957
## 19 300 98.91048 56.816833 141.00412
As an example, the model predicts for a movie with 120 minutes of runtime to have an audience score of 65. With 95% confidence we can say that the movie is expected to have an audience score from 25.8 to 100 (the upper limit).
Through the multiple linear regression model and the executed model selection it was found that whether or not the movie director has ever won an Oscar is not a significant variable for the audience score.
The fitted model for runtime and audience score of the movie yields an final equation of \[ {score_{audience}}=42.42+0.19\times{runtime} \]
This is illustrated in the figure below.
ggplot(df, aes(x=runtime, y=audience_score))+geom_point(alpha=0.05)+geom_smooth(method="lm")+theme_bw(base_family="Avenir")+labs(x="Runtime in minutes", y="Audience score", title="")+guides(color=guide_legend(title="Director ever won an Oscar"))+theme(legend.position="bottom")As it could be inferred in the model diagnostics, the final model has very low R-squared (0.03), which demonstrates that merely 3% of the variability of response variable(audience score) as regards the explanatory variable(runtime) can be explained by the model. Primarily the weak linearity between the explanatory variable and the response variable resulted in the poor fit of the model.
Nonetheless, this study demonstrated the results of linear regression when the conditions for the least squares line is not met.